CakePHP Valid XHTML/XML Behavior
If you have ever written a CMS-type application where you accept input from users to be stored as valid XHTML, you will probably have come up against some problems!
Generally this task is accomplished by using a javascript WYSIWYG real-time editor on the client side in order to keep things simple for content editors, and the resulting markup is stored on the server. Often though, content editors tend to work in Microsoft Word and paste their content into the javascript editor. That’s where the fun begins! Windows uses its own character set (thanks Microsoft!) known as code page 1252 which, whilst being mostly compatible with the much more common latin-1 character set, is not something you generally want to use on the web - UTF-8 is a much more sensible way to go. If the content is to be stored in a database, you also need to ensure it matches the character set used by your table.
Aside from Microsoft-induced headaches, you often have little control over the markup itself. Even the best javascript editors don’t get everything 100% correct all the time, and as well as technological issues there is also potential for human error (inserting unencoded html entities for example).
All in all then, you can’t really trust the markup you receive to be valid UTF-8 encoded XHTML. I found myself in this position during the development of a CMS using CakePHP, so I decided to write a Model Behaviour which can be used to clean up strings of markup to ensure they are valid and properly encoded. It ensures that the content is free of code page 1252 characters by converting them to UTF-8, replaces any unencoded HTML entities with their properly encoded equivalents (e.g. & -> &), fixes any invalid XHTML, and cleans and tidies the source code nicely.
The behaviour’s configuration is pretty simple. You just need to specify which fields should be automatically processed before they are saved:
1 2 3 4 5 | public $actsAs = array( 'ValidXhtml' => array( 'fields' => array('content') ) ); |
You can optionally specify whether to tidy the markup for each field using the PHP Tidy extension (default is true, so you only need to specify this to disable tidy):
1 2 3 4 5 6 7 | public $actsAs = array( 'ValidXhtml' => array( 'fields' => array( 'content' => array('tidy' => false) ) ) ); |
Obviously you need to have the tidy extension available to use that feature, but the Behaviour checks for the extension and will automatically configure itself accordingly, so there is no need to explicitly disable tidy if you don’t have it installed.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 | <?php class ValidXhtmlBehavior extends ModelBehavior { private $_defaults = array( 'fields' => array() ); private $_preProcessMap = array( // replace empty div tags with a div containing '/<div>\s*<\/div>/i' => '<div> </div>' ); // Map of windows 1252 character points to utf-8 character points private $_cp1252Map = array( "\xc2\x80" => "\xe2\x82\xac", /* EURO SIGN */ "\xc2\x82" => "\xe2\x80\x9a", /* SINGLE LOW-9 QUOTATION MARK */ "\xc2\x83" => "\xc6\x92", /* LATIN SMALL LETTER F WITH HOOK */ "\xc2\x84" => "\xe2\x80\x9e", /* DOUBLE LOW-9 QUOTATION MARK */ "\xc2\x85" => "\xe2\x80\xa6", /* HORIZONTAL ELLIPSIS */ "\xc2\x86" => "\xe2\x80\xa0", /* DAGGER */ "\xc2\x87" => "\xe2\x80\xa1", /* DOUBLE DAGGER */ "\xc2\x88" => "\xcb\x86", /* MODIFIER LETTER CIRCUMFLEX ACCENT */ "\xc2\x89" => "\xe2\x80\xb0", /* PER MILLE SIGN */ "\xc2\x8a" => "\xc5\xa0", /* LATIN CAPITAL LETTER S WITH CARON */ "\xc2\x8b" => "\xe2\x80\xb9", /* SINGLE LEFT-POINTING ANGLE QUOTATION */ "\xc2\x8c" => "\xc5\x92", /* LATIN CAPITAL LIGATURE OE */ "\xc2\x8e" => "\xc5\xbd", /* LATIN CAPITAL LETTER Z WITH CARON */ "\xc2\x91" => "\xe2\x80\x98", /* LEFT SINGLE QUOTATION MARK */ "\xc2\x92" => "\xe2\x80\x99", /* RIGHT SINGLE QUOTATION MARK */ "\xc2\x93" => "\xe2\x80\x9c", /* LEFT DOUBLE QUOTATION MARK */ "\xc2\x94" => "\xe2\x80\x9d", /* RIGHT DOUBLE QUOTATION MARK */ "\xc2\x95" => "\xe2\x80\xa2", /* BULLET */ "\xc2\x96" => "\xe2\x80\x93", /* EN DASH */ "\xc2\x97" => "\xe2\x80\x94", /* EM DASH */ "\xc2\x98" => "\xcb\x9c", /* SMALL TILDE */ "\xc2\x99" => "\xe2\x84\xa2", /* TRADE MARK SIGN */ "\xc2\x9a" => "\xc5\xa1", /* LATIN SMALL LETTER S WITH CARON */ "\xc2\x9b" => "\xe2\x80\xba", /* SINGLE RIGHT-POINTING ANGLE QUOTATION*/ "\xc2\x9c" => "\xc5\x93", /* LATIN SMALL LIGATURE OE */ "\xc2\x9e" => "\xc5\xbe", /* LATIN SMALL LETTER Z WITH CARON */ "\xc2\x9f" => "\xc5\xb8" /* LATIN CAPITAL LETTER Y WITH DIAERESIS*/ ); // Map of utf-8 chracter points to special html entities private $_entMap = array( "\xe2\x80\x98" => '‘', "\xe2\x80\x99" => '’', "\xe2\x80\x9c" => '“', "\xe2\x80\x9d" => '”', "\xe2\x82\xac" => '€', "\xe2\x80\xa6" => '…' ); /* For reference, these are other entity replacement codes which might be useful one day array( "\xe2\x80\x9a" => '‚', // Single Low-9 Quotation Mark "\xe2\x82\xac" => '€', // Euro sign "\xc6\x92" => 'ƒ', // Latin Small Letter F With Hook "\xe2\x80\x9e" => '„', // Double Low-9 Quotation Mark "\xe2\x80\xa6" => '…', // Horizontal Ellipsis "\xe2\x80\xa0" => '†', // Dagger "\xe2\x80\xa1" => '‡', // Double Dagger "\xcb\x86" => 'ˆ', // Modifier Letter Circumflex Accent "\xe2\x80\xb0" => '‰', // Per Mille Sign "\xc5\xa0" => 'Š', // Latin Capital Letter S With Caron "\xe2\x80\xb9" => '‹', // Single Left-Pointing Angle Quotation Mark "\xc5\x92" => 'Œ', // Latin Capital Ligature OE "\xe2\x80\x98" => '‘', // Left Single Quotation Mark "\xe2\x80\x99" => '’', // Right Single Quotation Mark "\xe2\x80\x9c" => '“', // Left Double Quotation Mark "\xe2\x80\x9d" => '”', // Right Double Quotation Mark "\xe2\x80\xa2" => '•', // Bullet "\xe2\x80\x93" => '–', // En Dash "\xe2\x80\x94" => '—', // Em Dash "\xcb\x9c" => '˜', // Small Tilde "\xe2\x84\xa2" => '™', // Trade Mark Sign "\xc5\xa1" => 'š', // Latin Small Letter S With Caron "\xe2\x80\xba" => '›', // Single Right-Pointing Angle Quotation Mark "\xc5\x93" => 'œ', // Latin Small Ligature OE "\xc5\xb8" => 'Ÿ', // Latin Capital Letter Y With Diaeresis ); */ public function setup($model, $config = array()) { $this->settings[$model->alias] = array_merge($this->_defaults, (array) $config); } public function beforeSave($model) { if (!empty($this->settings[$model->alias]['fields'])) { foreach ($this->settings[$model->alias]['fields'] as $key => $value) { if (is_array($value)) { $options = $value; $field = $key; } else { $field = $value; } $options['tidy'] = isset($options['tidy']) ? $options['tidy'] : true; if (isset($model->data[$model->alias][$field])) { $model->data[$model->alias][$field] = $this->makeValid($model->data[$model->alias][$field], $options['tidy']); } } } return true; } public function makeValid($string, $tidy = true) { $string = trim($string); // apply the pre-process map $string = preg_replace(array_keys($this->_preProcessMap), $this->_preProcessMap, $string); // apply the windows > utf8 map $string = str_replace(array_keys($this->_cp1252Map), $this->_cp1252Map, $string); // get rid of any existing html entities to avoid double encoding $string = html_entity_decode($string, ENT_QUOTES, 'UTF-8'); // break out any PHP sections since they should not be touched $parts = preg_split('/(<\?.+?\?>)/us', $string, -1, PREG_SPLIT_DELIM_CAPTURE); // replace &, ", ', < and > with their entities, but only where they are not // part of an html tag or a comment $string = ''; foreach ($parts as $part) { if (false === mb_strpos(trim($part), '<?')) { $string .= preg_replace_callback( '/(?<=\>)((?*[a-z][^>]*[>])[^<])+/ius', create_function( '$matches', 'return htmlspecialchars($matches[0]);' ), $part ); } else { $string .= $part; } } // apply the utf-8 > entities map $string = str_replace(array_keys($this->_entMap), $this->_entMap, $string); // trim whitespace from the end of each line and add a nice \n // tinymce in particular seems to have a bug where it will insert spaces // at the end of lines - this can cause problems with things like Revision // Behavior as the values of some fields will never be the same so a revision // is always saved even if the data itself has not changed. $parts = preg_split("/[\r\n]+/u", $string); foreach ($parts as &$part) { $part = rtrim($part); } $string = implode("\n", $parts); // tidy the output if ($tidy && extension_loaded('tidy')) { $tidy_config = array( 'output-xhtml' => true, 'show-body-only' => true, 'indent' => true, 'indent-spaces' => 4, 'sort-attributes' => 'alpha', 'wrap' => 80, 'preserve-entities' => true, 'join-styles' => false, 'logical-emphasis' => true, 'enclose-text' => true ); $tidy = tidy_parse_string($string, $tidy_config, 'UTF8'); $tidy->cleanRepair(); $string = $tidy; } return $string; } } ?> |